Задача 1: сравнение предложений

1. Считывание предложений из файла. Приведение к нижнему регистру.


In [288]:
with open('sentences.txt', 'r') as fileSentences:
    dataSentences = list(fileSentences)

In [289]:
for line in dataSentences:
    print line


In comparison to dogs, cats have not undergone major changes during the domestication process.

As cat simply catenates streams of bytes, it can be also used to concatenate binary files, where it will just concatenate sequence of bytes.

A common interactive use of cat for a single file is to output the content of a file to standard output.

Cats can hear sounds too faint or too high in frequency for human ears, such as those made by mice and other small animals.

In one, people deliberately tamed cats in a process of artificial selection, as they were useful predators of vermin.

The domesticated cat and its closest wild ancestor are both diploid organisms that possess 38 chromosomes and roughly 20,000 genes.

Domestic cats are similar in size to the other members of the genus Felis, typically weighing between 4 and 5 kg (8.8 and 11.0 lb).

However, if the output is piped or redirected, cat is unnecessary.

cat with one named file is safer where human error is a concern - one wrong use of the default redirection symbol ">" instead of "<" (often adjacent on keyboards) may permanently delete the file you were just needing to read.

In terms of legibility, a sequence of commands starting with cat and connected by pipes has a clear left-to-right flow of information.

Cat command is one of the basic commands that you learned when you started in the Unix / Linux world.

Using cat command, the lines received from stdin can be redirected to a new file using redirection symbols.

When you type simply cat command without any arguments, it just receives the stdin content and displays it in the stdout.

Leopard was released on October 26, 2007 as the successor of Tiger (version 10.4), and is available in two editions.

According to Apple, Leopard contains over 300 changes and enhancements over its predecessor, Mac OS X Tiger.

As of Mid 2010, some Apple computers have firmware factory installed which will no longer allow installation of Mac OS X Leopard.

Since Apple moved to using Intel processors in their computers, the OSx86 community has developed and now also allows Mac OS X Tiger and later releases to be installed on non-Apple x86-based computers.

OS X Mountain Lion was released on July 25, 2012 for purchase and download through Apple's Mac App Store, as part of a switch to releasing OS X versions online and every year.

Apple has released a small patch for the three most recent versions of Safari running on OS X Yosemite, Mavericks, and Mountain Lion.

The Mountain Lion release marks the second time Apple has offered an incremental upgrade, rather than releasing a new cat entirely.

Mac OS X Mountain Lion installs in place, so you won't need to create a separate disk or run the installation off an external drive.

The fifth major update to Mac OS X, Leopard, contains such a mountain of features - more than 300 by Apple's count.


In [290]:
dataSentencesLower = []
for i in xrange(len(dataSentences)):
    dataSentencesLower.append(dataSentences[i].lower())

In [291]:
for line in dataSentencesLower:
    print line


in comparison to dogs, cats have not undergone major changes during the domestication process.

as cat simply catenates streams of bytes, it can be also used to concatenate binary files, where it will just concatenate sequence of bytes.

a common interactive use of cat for a single file is to output the content of a file to standard output.

cats can hear sounds too faint or too high in frequency for human ears, such as those made by mice and other small animals.

in one, people deliberately tamed cats in a process of artificial selection, as they were useful predators of vermin.

the domesticated cat and its closest wild ancestor are both diploid organisms that possess 38 chromosomes and roughly 20,000 genes.

domestic cats are similar in size to the other members of the genus felis, typically weighing between 4 and 5 kg (8.8 and 11.0 lb).

however, if the output is piped or redirected, cat is unnecessary.

cat with one named file is safer where human error is a concern - one wrong use of the default redirection symbol ">" instead of "<" (often adjacent on keyboards) may permanently delete the file you were just needing to read.

in terms of legibility, a sequence of commands starting with cat and connected by pipes has a clear left-to-right flow of information.

cat command is one of the basic commands that you learned when you started in the unix / linux world.

using cat command, the lines received from stdin can be redirected to a new file using redirection symbols.

when you type simply cat command without any arguments, it just receives the stdin content and displays it in the stdout.

leopard was released on october 26, 2007 as the successor of tiger (version 10.4), and is available in two editions.

according to apple, leopard contains over 300 changes and enhancements over its predecessor, mac os x tiger.

as of mid 2010, some apple computers have firmware factory installed which will no longer allow installation of mac os x leopard.

since apple moved to using intel processors in their computers, the osx86 community has developed and now also allows mac os x tiger and later releases to be installed on non-apple x86-based computers.

os x mountain lion was released on july 25, 2012 for purchase and download through apple's mac app store, as part of a switch to releasing os x versions online and every year.

apple has released a small patch for the three most recent versions of safari running on os x yosemite, mavericks, and mountain lion.

the mountain lion release marks the second time apple has offered an incremental upgrade, rather than releasing a new cat entirely.

mac os x mountain lion installs in place, so you won't need to create a separate disk or run the installation off an external drive.

the fifth major update to mac os x, leopard, contains such a mountain of features - more than 300 by apple's count.

2. Произвести токенизацию (разбиение текстов на слова). Удалить пустые слова.


In [292]:
import re

In [293]:
dataWords = []
for line in dataSentencesLower:
    dataWords.append(re.split('[^a-z]', line))
    
for line in dataWords:
    print line


['in', 'comparison', 'to', 'dogs', '', 'cats', 'have', 'not', 'undergone', 'major', 'changes', 'during', 'the', 'domestication', 'process', '', '']
['as', 'cat', 'simply', 'catenates', 'streams', 'of', 'bytes', '', 'it', 'can', 'be', 'also', 'used', 'to', 'concatenate', 'binary', 'files', '', 'where', 'it', 'will', 'just', 'concatenate', 'sequence', 'of', 'bytes', '', '']
['a', 'common', 'interactive', 'use', 'of', 'cat', 'for', 'a', 'single', 'file', 'is', 'to', 'output', 'the', 'content', 'of', 'a', 'file', 'to', 'standard', 'output', '', '']
['cats', 'can', 'hear', 'sounds', 'too', 'faint', 'or', 'too', 'high', 'in', 'frequency', 'for', 'human', 'ears', '', 'such', 'as', 'those', 'made', 'by', 'mice', 'and', 'other', 'small', 'animals', '', '']
['in', 'one', '', 'people', 'deliberately', 'tamed', 'cats', 'in', 'a', 'process', 'of', 'artificial', 'selection', '', 'as', 'they', 'were', 'useful', 'predators', 'of', 'vermin', '', '']
['the', 'domesticated', 'cat', 'and', 'its', 'closest', 'wild', 'ancestor', 'are', 'both', 'diploid', 'organisms', 'that', 'possess', '', '', '', 'chromosomes', 'and', 'roughly', '', '', '', '', '', '', '', 'genes', '', '']
['domestic', 'cats', 'are', 'similar', 'in', 'size', 'to', 'the', 'other', 'members', 'of', 'the', 'genus', 'felis', '', 'typically', 'weighing', 'between', '', '', 'and', '', '', 'kg', '', '', '', '', '', 'and', '', '', '', '', '', 'lb', '', '', '']
['however', '', 'if', 'the', 'output', 'is', 'piped', 'or', 'redirected', '', 'cat', 'is', 'unnecessary', '', '']
['cat', 'with', 'one', 'named', 'file', 'is', 'safer', 'where', 'human', 'error', 'is', 'a', 'concern', '', '', 'one', 'wrong', 'use', 'of', 'the', 'default', 'redirection', 'symbol', '', '', '', '', 'instead', 'of', '', '', '', '', '', 'often', 'adjacent', 'on', 'keyboards', '', 'may', 'permanently', 'delete', 'the', 'file', 'you', 'were', 'just', 'needing', 'to', 'read', '', '']
['in', 'terms', 'of', 'legibility', '', 'a', 'sequence', 'of', 'commands', 'starting', 'with', 'cat', 'and', 'connected', 'by', 'pipes', 'has', 'a', 'clear', 'left', 'to', 'right', 'flow', 'of', 'information', '', '']
['cat', 'command', 'is', 'one', 'of', 'the', 'basic', 'commands', 'that', 'you', 'learned', 'when', 'you', 'started', 'in', 'the', 'unix', '', '', 'linux', 'world', '', '']
['using', 'cat', 'command', '', 'the', 'lines', 'received', 'from', 'stdin', 'can', 'be', 'redirected', 'to', 'a', 'new', 'file', 'using', 'redirection', 'symbols', '', '']
['when', 'you', 'type', 'simply', 'cat', 'command', 'without', 'any', 'arguments', '', 'it', 'just', 'receives', 'the', 'stdin', 'content', 'and', 'displays', 'it', 'in', 'the', 'stdout', '', '']
['leopard', 'was', 'released', 'on', 'october', '', '', '', '', '', '', '', '', '', 'as', 'the', 'successor', 'of', 'tiger', '', 'version', '', '', '', '', '', '', '', 'and', 'is', 'available', 'in', 'two', 'editions', '', '']
['according', 'to', 'apple', '', 'leopard', 'contains', 'over', '', '', '', '', 'changes', 'and', 'enhancements', 'over', 'its', 'predecessor', '', 'mac', 'os', 'x', 'tiger', '', '']
['as', 'of', 'mid', '', '', '', '', '', '', 'some', 'apple', 'computers', 'have', 'firmware', 'factory', 'installed', 'which', 'will', 'no', 'longer', 'allow', 'installation', 'of', 'mac', 'os', 'x', 'leopard', '', '']
['since', 'apple', 'moved', 'to', 'using', 'intel', 'processors', 'in', 'their', 'computers', '', 'the', 'osx', '', '', 'community', 'has', 'developed', 'and', 'now', 'also', 'allows', 'mac', 'os', 'x', 'tiger', 'and', 'later', 'releases', 'to', 'be', 'installed', 'on', 'non', 'apple', 'x', '', '', 'based', 'computers', '', '']
['os', 'x', 'mountain', 'lion', 'was', 'released', 'on', 'july', '', '', '', '', '', '', '', '', '', 'for', 'purchase', 'and', 'download', 'through', 'apple', 's', 'mac', 'app', 'store', '', 'as', 'part', 'of', 'a', 'switch', 'to', 'releasing', 'os', 'x', 'versions', 'online', 'and', 'every', 'year', '', '']
['apple', 'has', 'released', 'a', 'small', 'patch', 'for', 'the', 'three', 'most', 'recent', 'versions', 'of', 'safari', 'running', 'on', 'os', 'x', 'yosemite', '', 'mavericks', '', 'and', 'mountain', 'lion', '', '']
['the', 'mountain', 'lion', 'release', 'marks', 'the', 'second', 'time', 'apple', 'has', 'offered', 'an', 'incremental', 'upgrade', '', 'rather', 'than', 'releasing', 'a', 'new', 'cat', 'entirely', '', '']
['mac', 'os', 'x', 'mountain', 'lion', 'installs', 'in', 'place', '', 'so', 'you', 'won', 't', 'need', 'to', 'create', 'a', 'separate', 'disk', 'or', 'run', 'the', 'installation', 'off', 'an', 'external', 'drive', '', '']
['the', 'fifth', 'major', 'update', 'to', 'mac', 'os', 'x', '', 'leopard', '', 'contains', 'such', 'a', 'mountain', 'of', 'features', '', '', 'more', 'than', '', '', '', '', 'by', 'apple', 's', 'count', '', '']

In [294]:
dataWordsCleared = [[] for i in xrange(len(dataWords))]

i = 0
for line in dataWords:
    for word in line:
        if word != '':
            dataWordsCleared[i].append(word)
    i = i + 1

In [295]:
for line in dataWordsCleared:
    print line


['in', 'comparison', 'to', 'dogs', 'cats', 'have', 'not', 'undergone', 'major', 'changes', 'during', 'the', 'domestication', 'process']
['as', 'cat', 'simply', 'catenates', 'streams', 'of', 'bytes', 'it', 'can', 'be', 'also', 'used', 'to', 'concatenate', 'binary', 'files', 'where', 'it', 'will', 'just', 'concatenate', 'sequence', 'of', 'bytes']
['a', 'common', 'interactive', 'use', 'of', 'cat', 'for', 'a', 'single', 'file', 'is', 'to', 'output', 'the', 'content', 'of', 'a', 'file', 'to', 'standard', 'output']
['cats', 'can', 'hear', 'sounds', 'too', 'faint', 'or', 'too', 'high', 'in', 'frequency', 'for', 'human', 'ears', 'such', 'as', 'those', 'made', 'by', 'mice', 'and', 'other', 'small', 'animals']
['in', 'one', 'people', 'deliberately', 'tamed', 'cats', 'in', 'a', 'process', 'of', 'artificial', 'selection', 'as', 'they', 'were', 'useful', 'predators', 'of', 'vermin']
['the', 'domesticated', 'cat', 'and', 'its', 'closest', 'wild', 'ancestor', 'are', 'both', 'diploid', 'organisms', 'that', 'possess', 'chromosomes', 'and', 'roughly', 'genes']
['domestic', 'cats', 'are', 'similar', 'in', 'size', 'to', 'the', 'other', 'members', 'of', 'the', 'genus', 'felis', 'typically', 'weighing', 'between', 'and', 'kg', 'and', 'lb']
['however', 'if', 'the', 'output', 'is', 'piped', 'or', 'redirected', 'cat', 'is', 'unnecessary']
['cat', 'with', 'one', 'named', 'file', 'is', 'safer', 'where', 'human', 'error', 'is', 'a', 'concern', 'one', 'wrong', 'use', 'of', 'the', 'default', 'redirection', 'symbol', 'instead', 'of', 'often', 'adjacent', 'on', 'keyboards', 'may', 'permanently', 'delete', 'the', 'file', 'you', 'were', 'just', 'needing', 'to', 'read']
['in', 'terms', 'of', 'legibility', 'a', 'sequence', 'of', 'commands', 'starting', 'with', 'cat', 'and', 'connected', 'by', 'pipes', 'has', 'a', 'clear', 'left', 'to', 'right', 'flow', 'of', 'information']
['cat', 'command', 'is', 'one', 'of', 'the', 'basic', 'commands', 'that', 'you', 'learned', 'when', 'you', 'started', 'in', 'the', 'unix', 'linux', 'world']
['using', 'cat', 'command', 'the', 'lines', 'received', 'from', 'stdin', 'can', 'be', 'redirected', 'to', 'a', 'new', 'file', 'using', 'redirection', 'symbols']
['when', 'you', 'type', 'simply', 'cat', 'command', 'without', 'any', 'arguments', 'it', 'just', 'receives', 'the', 'stdin', 'content', 'and', 'displays', 'it', 'in', 'the', 'stdout']
['leopard', 'was', 'released', 'on', 'october', 'as', 'the', 'successor', 'of', 'tiger', 'version', 'and', 'is', 'available', 'in', 'two', 'editions']
['according', 'to', 'apple', 'leopard', 'contains', 'over', 'changes', 'and', 'enhancements', 'over', 'its', 'predecessor', 'mac', 'os', 'x', 'tiger']
['as', 'of', 'mid', 'some', 'apple', 'computers', 'have', 'firmware', 'factory', 'installed', 'which', 'will', 'no', 'longer', 'allow', 'installation', 'of', 'mac', 'os', 'x', 'leopard']
['since', 'apple', 'moved', 'to', 'using', 'intel', 'processors', 'in', 'their', 'computers', 'the', 'osx', 'community', 'has', 'developed', 'and', 'now', 'also', 'allows', 'mac', 'os', 'x', 'tiger', 'and', 'later', 'releases', 'to', 'be', 'installed', 'on', 'non', 'apple', 'x', 'based', 'computers']
['os', 'x', 'mountain', 'lion', 'was', 'released', 'on', 'july', 'for', 'purchase', 'and', 'download', 'through', 'apple', 's', 'mac', 'app', 'store', 'as', 'part', 'of', 'a', 'switch', 'to', 'releasing', 'os', 'x', 'versions', 'online', 'and', 'every', 'year']
['apple', 'has', 'released', 'a', 'small', 'patch', 'for', 'the', 'three', 'most', 'recent', 'versions', 'of', 'safari', 'running', 'on', 'os', 'x', 'yosemite', 'mavericks', 'and', 'mountain', 'lion']
['the', 'mountain', 'lion', 'release', 'marks', 'the', 'second', 'time', 'apple', 'has', 'offered', 'an', 'incremental', 'upgrade', 'rather', 'than', 'releasing', 'a', 'new', 'cat', 'entirely']
['mac', 'os', 'x', 'mountain', 'lion', 'installs', 'in', 'place', 'so', 'you', 'won', 't', 'need', 'to', 'create', 'a', 'separate', 'disk', 'or', 'run', 'the', 'installation', 'off', 'an', 'external', 'drive']
['the', 'fifth', 'major', 'update', 'to', 'mac', 'os', 'x', 'leopard', 'contains', 'such', 'a', 'mountain', 'of', 'features', 'more', 'than', 'by', 'apple', 's', 'count']

3. Составить список всех слов.


In [296]:
dictWords = {}

i = 0
for line in dataWordsCleared:
    for word in line:
        if word not in dictWords:
            dictWords[word] = i
            i += 1
            
for item in dictWords:
    print item, ": ", dictWords[item]


displays :  155
osx :  192
selection :  67
safari :  221
just :  31
developed :  194
over :  170
vermin :  72
domestic :  87
named :  104
installed :  181
symbols :  149
through :  206
human :  51
world :  142
disk :  244
its :  74
fifth :  249
features :  251
tamed :  65
upgrade :  232
lb :  97
drive :  248
to :  2
won :  239
deliberately :  64
marks :  226
has :  129
predecessor :  172
non :  199
which :  182
read :  122
october :  160
every :  215
os :  174
they :  68
not :  6
during :  10
now :  195
possess :  83
intel :  189
keyboards :  116
bytes :  20
unnecessary :  102
patch :  217
predators :  71
small :  60
output :  41
entirely :  235
where :  29
ears :  52
available :  164
on :  115
often :  113
sequence :  32
some :  177
lion :  202
frequency :  50
are :  78
year :  216
download :  205
terms :  123
concern :  107
error :  106
for :  37
pipes :  128
since :  187
factory :  180
artificial :  66
content :  42
version :  163
run :  245
between :  95
new :  148
learned :  137
three :  218
piped :  100
common :  34
concatenate :  26
be :  23
weighing :  94
genes :  86
use :  36
standard :  43
release :  225
diploid :  80
members :  90
x :  175
based :  200
safer :  105
by :  56
both :  79
commands :  125
installation :  186
installs :  236
of :  19
needing :  121
allows :  196
according :  167
july :  203
later :  197
mac :  173
s :  207
streams :  18
receives :  154
successor :  161
catenates :  17
changes :  9
or :  48
felis :  92
major :  8
faint :  47
useful :  70
apple :  168
app :  208
community :  193
one :  62
running :  222
unix :  140
right :  132
simply :  16
linux :  141
sounds :  45
size :  89
undergone :  7
delete :  119
from :  146
enhancements :  171
second :  227
their :  191
create :  242
people :  63
two :  165
t :  240
redirection :  110
however :  98
cats :  4
too :  46
basic :  136
permanently :  118
type :  150
dogs :  3
store :  209
more :  252
files :  28
releases :  198
that :  82
started :  139
contains :  169
releasing :  212
tiger :  162
released :  159
part :  210
hear :  44
external :  247
editions :  166
off :  246
mice :  57
with :  103
than :  234
those :  54
longer :  184
count :  253
made :  55
animals :  61
mavericks :  224
versions :  213
default :  109
was :  158
single :  38
cat :  15
will :  30
can :  22
were :  69
wild :  76
similar :  88
interactive :  35
and :  58
mountain :  201
computers :  178
have :  5
stdout :  156
process :  13
lines :  144
is :  40
received :  145
moved :  188
it :  21
an :  230
high :  49
as :  14
incremental :  231
file :  39
in :  0
need :  241
domesticated :  73
any :  152
domestication :  12
if :  99
binary :  27
processors :  190
no :  183
rather :  233
legibility :  124
separate :  243
firmware :  179
when :  138
mid :  176
also :  24
other :  59
arguments :  153
adjacent :  114
online :  214
instead :  112
you :  120
ancestor :  77
offered :  229
used :  25
chromosomes :  84
closest :  75
information :  134
may :  117
symbol :  111
leopard :  157
update :  250
most :  219
wrong :  108
connected :  127
yosemite :  223
such :  53
comparison :  1
recent :  220
a :  33
purchase :  204
genus :  91
kg :  96
organisms :  81
using :  143
starting :  126
clear :  130
stdin :  147
flow :  133
roughly :  85
so :  238
switch :  211
without :  151
command :  135
place :  237
allow :  185
time :  228
redirected :  101
the :  11
typically :  93
left :  131

In [297]:
import pandas as pd

In [298]:
print len(dictWords)
#число различных слов в файле


254

In [299]:
print dictWords.keys()


['displays', 'osx', 'selection', 'safari', 'just', 'developed', 'over', 'vermin', 'domestic', 'named', 'installed', 'symbols', 'through', 'human', 'world', 'disk', 'its', 'fifth', 'features', 'tamed', 'upgrade', 'lb', 'drive', 'to', 'won', 'deliberately', 'marks', 'has', 'predecessor', 'non', 'which', 'read', 'october', 'every', 'os', 'they', 'not', 'during', 'now', 'possess', 'intel', 'keyboards', 'bytes', 'unnecessary', 'patch', 'predators', 'small', 'output', 'entirely', 'where', 'ears', 'available', 'on', 'often', 'sequence', 'some', 'lion', 'frequency', 'are', 'year', 'download', 'terms', 'concern', 'error', 'for', 'pipes', 'since', 'factory', 'artificial', 'content', 'version', 'run', 'between', 'new', 'learned', 'three', 'piped', 'common', 'concatenate', 'be', 'weighing', 'genes', 'use', 'standard', 'release', 'diploid', 'members', 'x', 'based', 'safer', 'by', 'both', 'commands', 'installation', 'installs', 'of', 'needing', 'allows', 'according', 'july', 'later', 'mac', 's', 'streams', 'receives', 'successor', 'catenates', 'changes', 'or', 'felis', 'major', 'faint', 'useful', 'apple', 'app', 'community', 'one', 'running', 'unix', 'right', 'simply', 'linux', 'sounds', 'size', 'undergone', 'delete', 'from', 'enhancements', 'second', 'their', 'create', 'people', 'two', 't', 'redirection', 'however', 'cats', 'too', 'basic', 'permanently', 'type', 'dogs', 'store', 'more', 'files', 'releases', 'that', 'started', 'contains', 'releasing', 'tiger', 'released', 'part', 'hear', 'external', 'editions', 'off', 'mice', 'with', 'than', 'those', 'longer', 'count', 'made', 'animals', 'mavericks', 'versions', 'default', 'was', 'single', 'cat', 'will', 'can', 'were', 'wild', 'similar', 'interactive', 'and', 'mountain', 'computers', 'have', 'stdout', 'process', 'lines', 'is', 'received', 'moved', 'it', 'an', 'high', 'as', 'incremental', 'file', 'in', 'need', 'domesticated', 'any', 'domestication', 'if', 'binary', 'processors', 'no', 'rather', 'legibility', 'separate', 'firmware', 'when', 'mid', 'also', 'other', 'arguments', 'adjacent', 'online', 'instead', 'you', 'ancestor', 'offered', 'used', 'chromosomes', 'closest', 'information', 'may', 'symbol', 'leopard', 'update', 'most', 'wrong', 'connected', 'yosemite', 'such', 'comparison', 'recent', 'a', 'purchase', 'genus', 'kg', 'organisms', 'using', 'starting', 'clear', 'stdin', 'flow', 'roughly', 'so', 'switch', 'without', 'command', 'place', 'allow', 'time', 'redirected', 'the', 'typically', 'left']

In [300]:
frameWords = pd.DataFrame(dictWords, xrange(len(dataWordsCleared)))
rowsCnt, colCnt = frameWords.shape
print rowsCnt, colCnt


22 254

In [301]:
for i in xrange(rowsCnt):
    for j in xrange(colCnt):
        frameWords.ix[i, j] = 0
        
frameWords


Out[301]:
a according adjacent allow allows also an ancestor and animals ... will with without won world wrong x year yosemite you
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
18 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
20 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
21 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

22 rows × 254 columns


In [302]:
print len(dataWordsCleared)
for i in xrange(len(dataWordsCleared)):
    for word in dataWordsCleared[i]:
        frameWords.ix[i, word] += 1
        
frameWords


22
Out[302]:
a according adjacent allow allows also an ancestor and animals ... will with without won world wrong x year yosemite you
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 1 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
2 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 1 1 ... 0 0 0 0 0 0 0 0 0 0
4 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 1 2 0 ... 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 2 0 ... 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 1 0 1 0 0 0 0 0 0 0 ... 0 1 0 0 0 1 0 0 0 1
9 2 0 0 0 0 0 0 0 1 0 ... 0 1 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 2
11 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 1 0 ... 0 0 1 0 0 0 0 0 0 1
13 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
14 0 1 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 1 0 0 0
15 0 0 0 1 0 0 0 0 0 0 ... 1 0 0 0 0 0 1 0 0 0
16 0 0 0 0 1 1 0 0 2 0 ... 0 0 0 0 0 0 2 0 0 0
17 1 0 0 0 0 0 0 0 2 0 ... 0 0 0 0 0 0 2 1 0 0
18 1 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 1 0 1 0
19 1 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
20 1 0 0 0 0 0 1 0 0 0 ... 0 0 0 1 0 0 1 0 0 1
21 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0

22 rows × 254 columns

6. Найти косинусное расстояние от предложения в самой первой строке до всех остальных.


In [313]:
from scipy.spatial.distance import cosine

In [357]:
distanceFromFirstSentence = []

for i in xrange(rowsCnt):
    distanceFromFirstSentence.append(cosine(frameWords.ix[0], frameWords.ix[i]))
    
print distanceFromFirstSentence


[0.0, 0.95275444087384664, 0.86447381456421235, 0.89517151632780823, 0.77708871496985887, 0.94023856953328033, 0.7327387580875756, 0.92587506833388988, 0.88427248752843102, 0.90550888174769317, 0.83281653622739416, 0.88047713906656067, 0.83964325485254543, 0.87035925528956715, 0.87401184233025764, 0.94427217874246472, 0.84063618542208085, 0.95664450152379399, 0.94427217874246472, 0.88854435748492944, 0.84275727449171223, 0.82503644694405864]

In [358]:
distanceFromFirstSentenceCopy = list(distanceFromFirstSentence)
twoClosestValues = [[-1, 0], [-1, 0]]

distanceFromFirstSentenceCopy.remove(min(distanceFromFirstSentenceCopy))

for i in xrange(2):
    twoClosestValues[i][1] = min(distanceFromFirstSentenceCopy)
    for j in xrange(len(distanceFromFirstSentence)):
        if twoClosestValues[i][1] == distanceFromFirstSentence[j]:
            twoClosestValues[i][0] = j
    distanceFromFirstSentenceCopy.remove(min(distanceFromFirstSentenceCopy))

twoClosestValues = sorted(twoClosestValues)
twoClosestValues


Out[358]:
[[4, 0.77708871496985887], [6, 0.7327387580875756]]

7. Запись ответа в файл.


In [359]:
with open('answer.txt', 'w') as fileAnswer:
    for i in xrange(len(twoClosestValues)):
        fileAnswer.write(str(twoClosestValues[i][0]) + ' ')

In [ ]: